-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
BUG: avoid float upcast when mixing signed/unsigned ints in isin #62608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: avoid float upcast when mixing signed/unsigned ints in isin #62608
Conversation
pandas/core/algorithms.py
Outdated
# If the dtypes differ and either side is unsigned integer, | ||
# prefer object dtype to avoid unsafe upcast to float64 that | ||
# can lose precision for large 64-bit integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change the performance when values
and comps
are both integer like and fit within an integer type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix takes the conservative approach of converting values to [object] when mixing signed and unsigned integer dtypes to ensure correctness. This preserves exact integer equality but may be slower for very large arrays compared to a numeric-only path.
This trade-off is favoring correctness over the rare case of very large arrays with mixed signed/unsigned ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach for better performance can be to remove the earlier asymmetric object-conversion block and add a fast, safe numeric path that correctly handles signed/unsigned mixes without converting to object.
The idea is to use masked uint64 lookups to avoid float casts and preserve performance. I’ll place this fast-path after the comps_array extraction and before the common-type coercion, by mapping signed int64 and uint64 values into the wider unsigned space and performing hashtable lookups on uint64.
This will involve changes roughly around lines algorithms.py+6-16. I’ll also run the new tests afterward to verify behavior. Not sure if that breaks something, should I try?
Removed comments from the test case for isin method.
Clarified comments regarding fast-path application for integer widths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping. You'll also want to add a whatsnew note for this to v3.0.0
if ( | ||
values.dtype.kind in "iu" | ||
and comps_array.dtype.kind in "iu" | ||
# Only apply fast-path for 64-bit integer widths to avoid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on this comment? I don't fully understand what the surprising behavior is, and I'm worried its a red herring
values_u = values.astype("uint64", copy=False) | ||
comps_u = comps_array.astype("uint64", copy=False) | ||
return htable.ismember(comps_u, values_u) | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We almost never toss a base Exception in the code base. What is this trying to catch here?
…ts added)
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Summary
Fix incorrect Series.isin results when comparing signed int64 values with uint64 values that are not equal. Previously, mixing signed and unsigned 64-bit integers could trigger a numeric common-type coercion to float64 which may lose precision and produce false positives. This change prevents that unsafe upcast by preferring an object-based comparison when signed and unsigned integer types are mixed.
Root cause
When isin attempted to find a common numeric dtype between comps (left side) and values (right side), mixing signed int64 with uint64 could lead to casting both sides to float64. Converting large 64-bit integers to float64 loses precision and can make two distinct integers compare as equal.
What I changed
[algorithms.py]:
Adjusted the condition used before converting values to an object array so that when dtypes differ and either side is an unsigned integer, [values] is converted to object (i.e., Python-level equality / hashtable lookup) instead of numeric coercion. This makes the mixed signed/unsigned decision symmetric and avoids unsafe float upcasts.
[test_isin.py]
Added test_isin_int64_vs_uint64_mismatch which reproduces the reported case and asserts the correct False result.